This document contains instructions on Project 1 for STA 207 in Winter 2021. This document is made with R markdown. The rmd file to generate this document is available on the course website.
In this project, we study the dataset from a very influential randomized experiment. The Tennesses Student/Teacher Achievement Ratio study (a.k.a. Project STAR) was conducted in the late 1980s to evaluate the effect of class sizes on test scores. This dataset has been used as a classic examples in many textbooks and research papers. You are encouraged to read more about the experiment design and how others analyze this dataset. This document only provides a brief explanation of the dataset for this course project.
The study randomly assigned students to small classes, regular classes, and regular classes with a teacher’s aide. In order to randomize properly, schools were enrolled only if they had enough student body to have at least one class of each type. Once the schools were enrolled, students were randomly assigned to the three types of classes, and one teacher was randomly assigned to each class.
The dataset contains scaled scores for math and reading from kindergarten to 3rd grade. We will only examine the math scores in 1st grade in this project. The primary question of interest is whether there is any differences in math scaled scores in 1st grade across class types, and if so, a secondary question of interest is which class type is associated with the highest math scaled scores in 1st grade. In particular, we will treat each teacher as the basic unit of our analysis. To put it in another way, we will treat each class (uniquely identified by its assigned teacher) as an observation. Noting that there are multiple students in each class, some data manipulation are warranted.
The following list provides one potential structure of the data analysis report. A detailed template for this project is provided in a separate RMD file. The detailed template is provided as a learning tool for the first data analysis project in this course. There will not be any templates for future projects.
library(ggplot2)
library(MASS)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(gplots)
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble 3.0.4 ✓ purrr 0.3.4
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x dplyr::select() masks MASS::select()
“The dataset contains scaled scores for math and reading from kindergarten to 3rd grade. We will only examine the math scores in 1st grade in this project. The primary question of interest is whether there is any differences in math scaled scores in 1st grade across class types, and if so, a secondary question of interest is which class type is associated with the highest math scaled scores in 1st grade. In particular, we will treat each teacher as the basic unit of our analysis. To put it in another way, we will treat each class (uniquely identified by its assigned teacher) as an observation. Noting that there are multiple students in each class, some data manipulation are warranted.”
students = read.csv('STAR_Students.tab', sep='\t')
#head(students)
#colnames(students)
length(rownames(students)) # total number of observations
## [1] 11601
students_bu = students
students$g1classtype = as.integer(students$g1classtype)
students = students %>% filter(!is.na(g1tmathss))
students = students %>% filter(!is.na(g1classtype))
students = students %>% filter(!is.na(g1schid))
#students$g1mathss = students[!is.na(students$g1mathss)]
#pairwise.t.test(x=students$g1mathss, g=students$g1classtype, p.adjust.method = 'bonf')
#head(students)
length(rownames(students))
## [1] 6598
teach_agg = aggregate(list(students$g1tmathss, students$g1schid, students$g1classtype, students$g1classsize),
by = list(students$g1tchid),
FUN = mean)
colnames(teach_agg) = c('g1tchid', 'g1tmathss', 'g1schid', 'g1classtype', 'g1classsize' )
#teach_agg$class_type = teach_class_agg$x
head(teach_agg)
## g1tchid g1tmathss g1schid g1classtype g1classsize
## 1 11203804 486.8750 112038 3 28
## 2 11203805 499.8235 112038 1 17
## 3 11203806 496.0370 112038 2 28
## 4 12305604 523.7391 123056 2 25
## 5 12305605 532.8000 123056 3 27
## 6 12305606 534.4706 123056 1 19
#boxplot(students$g1mathbsobjpct, as.factor(students$g1classtype))
hist(teach_agg$g1tmathss)
summary(as.factor(teach_agg$g1classtype))
## 1 2 3
## 124 115 100
teach_agg$g1schid = as.factor(teach_agg$g1schid)
teach_agg$g1classtype = as.factor(teach_agg$g1classtype)
ggplot(teach_agg, aes(x=g1classtype, y=g1tmathss, fill=g1tmathss)) +
geom_boxplot(alpha=0.7) +
stat_summary(fun=mean, geom="point", shape=20, size=5, color="blue", fill="black") +
labs(title='Class Type vs Math Scaled Score', x='Class Type', y='Math Scaled Score')
ggplot(teach_agg, aes(x=g1schid, y=g1tmathss, fill=g1tmathss)) +
geom_boxplot(alpha=0.7) +
stat_summary(fun=mean, geom="point", shape=20, size=2, color="blue", fill="black") +
labs(title='School ID vs Math Scaled Score', x='School ID', y='Math Scaled Score') +
theme(axis.text.x = element_text(angle = 90))
# Main effects plot
plotmeans(g1tmathss~g1classtype,data=teach_agg,xlab="Class Type",ylab="Math Scaled Score", main="Main effect, Class Type")
# Main effects plot
?plotmeans
plotmeans(g1tmathss~g1schid,data=teach_agg,n.label=FALSE,xlab="School ID",ylab="Math Scaled Score", main="Main effect, School ID")
## Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, : zero-
## length arrow is of indeterminate angle and so skipped
## Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, : zero-
## length arrow is of indeterminate angle and so skipped
## Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, : zero-
## length arrow is of indeterminate angle and so skipped
## Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, : zero-
## length arrow is of indeterminate angle and so skipped
## Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, : zero-
## length arrow is of indeterminate angle and so skipped
## Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, : zero-
## length arrow is of indeterminate angle and so skipped
## Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, : zero-
## length arrow is of indeterminate angle and so skipped
## Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, : zero-
## length arrow is of indeterminate angle and so skipped
## Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, : zero-
## length arrow is of indeterminate angle and so skipped
## Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, : zero-
## length arrow is of indeterminate angle and so skipped
The null hypothesis for the primary question of interest is \(H_0 : \alpha_1 = \alpha_2 = \alpha_3 = 0\), and the alternative is \(H_a\) : not all \(\alpha\)s are zero. You can find the test statistic and p-value using summary(anova.fit), if you save your fitted model as anova.fit. Please be sure specify the significance level and interpret your test result. Explain any additional assumptions involved in this test.
length(unique(teach_agg$g1schid))
## [1] 76
We can define a two-way ANOVA model as follows
\(Y_{ijk} = \mu_{..} + \alpha_{i} + \beta_{j} + \epsilon_{ijk}\),
where the index \(i\) represents the class type: small (\(i=1\)), regular (\(i=2\)), regular with aide (\(i=3\)),
and the index \(j\) represents the school indicator \(j=1....76\) based on the result shown above. .
\(\alpha_i\) is the main effect of the factor class type.
\(\beta_j\) is the main effect of the factor school ID.
You need to explain the rest of the parameters, state constraints on the parameters, and justify the choice of model (e.g., why no interaction terms).
hist(students$g1tmathss)
fit_tch_sch = lm(g1tmathss ~ g1schid + g1classtype, data = teach_agg)
anova(fit_tch_sch)
## Analysis of Variance Table
##
## Response: g1tmathss
## Df Sum Sq Mean Sq F value Pr(>F)
## g1schid 75 136424 1819.0 6.5733 < 2.2e-16 ***
## g1classtype 2 12026 6013.1 21.7298 1.866e-09 ***
## Residuals 261 72225 276.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit = aov(g1tmathss ~ g1classtype + g1schid + g1classtype:g1schid, data = teach_agg)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## g1classtype 2 11617 5809 19.936 3.69e-08 ***
## g1schid 75 136833 1824 6.262 < 2e-16 ***
## g1classtype:g1schid 146 38718 265 0.910 0.706
## Residuals 115 33507 291
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit = aov(g1tmathss ~ g1classtype + g1schid, data = teach_agg)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## g1classtype 2 11617 5809 20.991 3.52e-09 ***
## g1schid 75 136833 1824 6.593 < 2e-16 ***
## Residuals 261 72225 277
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(fit)
boxcox(fit)
#?boxcox
fit2 = aov(g1tmathss^-2 ~ g1classtype + g1schid, data = teach_agg)
summary(fit2)
## Df Sum Sq Mean Sq F value Pr(>F)
## g1classtype 2 1.967e-12 9.833e-13 20.588 4.99e-09 ***
## g1schid 75 2.512e-11 3.350e-13 7.014 < 2e-16 ***
## Residuals 261 1.247e-11 4.780e-14
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
test=pairwise.t.test(teach_agg$g1tmathss^-2, teach_agg$g1classtype)
test$p.value
## 1 2
## 2 0.0001606678 NA
## 3 0.0143422668 0.2251094
pairwise.t.test(teach_agg$g1tmathss^-2, teach_agg$g1classtype, p.adjust.method = 'bonf')
##
## Pairwise comparisons using t tests with pooled SD
##
## data: teach_agg$g1tmathss^-2 and teach_agg$g1classtype
##
## 1 2
## 2 0.00016 -
## 3 0.02151 0.67533
##
## P value adjustment method: bonferroni
pairwise.t.test(teach_agg$g1tmathss^-2, teach_agg$g1classtype, p.adjust.method = 'BH')
##
## Pairwise comparisons using t tests with pooled SD
##
## data: teach_agg$g1tmathss^-2 and teach_agg$g1classtype
##
## 1 2
## 2 0.00016 -
## 3 0.01076 0.22511
##
## P value adjustment method: BH
alpha=0.05;
T.ci=TukeyHSD(fit2,conf.level = 1-alpha)
# The highest wage is in management, and the second largest is in technical
get("g1classtype", T.ci)
## diff lwr upr p adj
## 2-1 1.771954e-07 1.105047e-07 2.438860e-07 4.640207e-09
## 3-1 1.216169e-07 5.237978e-08 1.908539e-07 1.382845e-04
## 3-2 -5.557851e-08 -1.260147e-07 1.485769e-08 1.525267e-01
par(mfrow=c(2,2))
plot(fit2)
boxcox(fit2)
my_star <- function(mathvar, classtypevar, schoolvar, teachervar){
students = read.csv('STAR_Students.tab', sep='\t')
#students$g1classtype = as.integer(students$c)
students = students %>% filter(!is.na(!!as.symbol(mathvar)))
students = students %>% filter(!is.na(!!as.symbol(teachervar)))
students = students %>% filter(!is.na(!!as.symbol(schoolvar)))
students = students %>% filter(!is.na(!!as.symbol(classtypevar)))
teach_agg <- aggregate(list(students[,mathvar], students[,schoolvar], students[,classtypevar]),
by = list(students[,teachervar]),
FUN = mean)
colnames(teach_agg) = c(teachervar, mathvar, schoolvar, classtypevar)
teach_agg$classtypevar = as.factor(teach_agg[,classtypevar])
teach_agg$schoolvar = as.factor(teach_agg[,schoolvar])
teach_agg$mathvar = teach_agg[,mathvar]
print(head(teach_agg))
fit_my_star = aov(mathvar ~ classtypevar + schoolvar, data = teach_agg)
return(teach_agg)
# transformation should be considered for every scenario seperately
}
my_star_stat<- function(agg, power,fit){
print(summary(fit))
par(mfrow=c(2,2))
plot(fit)
# Pairwise t test with different correction methods
print(pairwise.t.test(teach_agg$mathvar^power, teach_agg$classtypevar)$p.value)
print(pairwise.t.test(teach_agg$mathvar^power, teach_agg$classtypevar, p.adjust.method = 'bonf')$p.value)
print(pairwise.t.test(teach_agg$mathvar^power, teach_agg$classtypevar, p.adjust.method = 'BH')$p.value)
# One approach is to use the Tukey’s method to construction simultaneous confidence intervals for all pairwise comparisons.
alpha=0.05;
T.ci=TukeyHSD(fit,conf.level = 1-alpha)
print(T.ci$`teach_agg$classtypevar`)
}
teach_agg = my_star('g1tmathss', 'g1classtype', 'g1schid', 'g1tchid')
## g1tchid g1tmathss g1schid g1classtype classtypevar schoolvar mathvar
## 1 11203804 486.8750 112038 3 3 112038 486.8750
## 2 11203805 499.8235 112038 1 1 112038 499.8235
## 3 11203806 496.0370 112038 2 2 112038 496.0370
## 4 12305604 523.7391 123056 2 2 123056 523.7391
## 5 12305605 532.8000 123056 3 3 123056 532.8000
## 6 12305606 534.4706 123056 1 1 123056 534.4706
fit = aov(teach_agg$mathvar ~ teach_agg$classtypevar + teach_agg$schoolvar);
boxcox(fit)
new_fit = aov(teach_agg$mathvar^-2 ~ teach_agg$classtypevar + teach_agg$schoolvar)
boxcox(new_fit)
my_star_stat(teach_agg, -2, new_fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## teach_agg$classtypevar 2 1.967e-12 9.833e-13 20.588 4.99e-09 ***
## teach_agg$schoolvar 75 2.512e-11 3.350e-13 7.014 < 2e-16 ***
## Residuals 261 1.247e-11 4.780e-14
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 2
## 2 0.0001606678 NA
## 3 0.0143422668 0.2251094
## 1 2
## 2 0.0001606678 NA
## 3 0.0215134003 0.6753281
## 1 2
## 2 0.0001606678 NA
## 3 0.0107567001 0.2251094
## diff lwr upr p adj
## 2-1 1.771954e-07 1.105047e-07 2.438860e-07 4.640207e-09
## 3-1 1.216169e-07 5.237978e-08 1.908539e-07 1.382845e-04
## 3-2 -5.557851e-08 -1.260147e-07 1.485769e-08 1.525267e-01
teach_agg = my_star('g2tmathss', 'g1classtype', 'g1schid', 'g1tchid')
## g1tchid g2tmathss g1schid g1classtype classtypevar schoolvar mathvar
## 1 11203804 546.4286 112038 3 3 112038 546.4286
## 2 11203805 549.0000 112038 1 1 112038 549.0000
## 3 11203806 558.0000 112038 2 2 112038 558.0000
## 4 12305604 591.6667 123056 2 2 123056 591.6667
## 5 12305605 598.8000 123056 3 3 123056 598.8000
## 6 12305606 580.2500 123056 1 1 123056 580.2500
fit = aov(teach_agg$mathvar ~ teach_agg$classtypevar + teach_agg$schoolvar);
boxcox(fit)
new_fit = aov(teach_agg$mathvar^-1 ~ teach_agg$classtypevar + teach_agg$schoolvar)
boxcox(new_fit)
my_star_stat(teach_agg, -1, new_fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## teach_agg$classtypevar 2 6.390e-08 3.195e-08 12.764 5.22e-06 ***
## teach_agg$schoolvar 74 1.205e-06 1.629e-08 6.507 < 2e-16 ***
## Residuals 254 6.357e-07 2.500e-09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 2
## 2 0.00259797 NA
## 3 0.13809623 0.1644885
## 1 2
## 2 0.00259797 NA
## 3 0.20714435 0.4934654
## 1 2
## 2 0.00259797 NA
## 3 0.10357218 0.1644885
## diff lwr upr p adj
## 2-1 3.301297e-05 1.755248e-05 4.847345e-05 2.718957e-06
## 3-1 1.860550e-05 2.547198e-06 3.466381e-05 1.844975e-02
## 3-2 -1.440746e-05 -3.068783e-05 1.872898e-06 9.468932e-02
teach_agg = my_star('g3tmathss', 'g1classtype', 'g1schid', 'g1tchid')
## g1tchid g3tmathss g1schid g1classtype classtypevar schoolvar mathvar
## 1 11203804 610.0000 112038 3 3 112038 610.0000
## 2 11203805 590.7500 112038 1 1 112038 590.7500
## 3 11203806 609.3333 112038 2 2 112038 609.3333
## 4 12305604 619.8462 123056 2 2 123056 619.8462
## 5 12305605 634.8421 123056 3 3 123056 634.8421
## 6 12305606 618.6154 123056 1 1 123056 618.6154
fit = aov(teach_agg$mathvar ~ teach_agg$classtypevar + teach_agg$schoolvar);
boxcox(fit)
new_fit = aov(log(teach_agg$mathvar) ~ teach_agg$classtypevar + teach_agg$schoolvar)
boxcox(new_fit)
# do this one custom
teach_agg$g3tmathss = log(teach_agg$g3tmathss)
my_star_stat(teach_agg, 1, new_fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## teach_agg$classtypevar 2 0.01187 0.005933 10.355 4.74e-05 ***
## teach_agg$schoolvar 74 0.26348 0.003560 6.214 < 2e-16 ***
## Residuals 256 0.14667 0.000573
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 2
## 2 0.01969451 NA
## 3 0.02123816 0.9504549
## 1 2
## 2 0.01969451 NA
## 3 0.03185724 1
## 1 2
## 2 0.01592862 NA
## 3 0.01592862 0.9504549
## diff lwr upr p adj
## 2-1 -0.0125013238 -0.019868577 -0.005134070 0.0002438196
## 3-1 -0.0122564119 -0.019910777 -0.004602046 0.0005818623
## 3-2 0.0002449119 -0.007544054 0.008033877 0.9969753167
teach_agg = my_star('g2tmathss', 'g2classtype', 'g2schid', 'g2tchid')
## g2tchid g2tmathss g2schid g2classtype classtypevar schoolvar mathvar
## 1 11203807 563.3500 112038 2 2 112038 563.3500
## 2 11203808 548.0000 112038 3 3 112038 548.0000
## 3 11203809 562.1333 112038 1 1 112038 562.1333
## 4 12305607 595.4500 123056 2 2 123056 595.4500
## 5 12305608 600.3810 123056 3 3 123056 600.3810
## 6 12305609 570.1538 123056 1 1 123056 570.1538
fit = aov(teach_agg$mathvar ~ teach_agg$classtypevar + teach_agg$schoolvar);
boxcox(fit)
new_fit = aov(teach_agg$mathvar^1 ~ teach_agg$classtypevar + teach_agg$schoolvar)
boxcox(new_fit)
my_star_stat(teach_agg, 1, new_fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## teach_agg$classtypevar 2 5990 2994.8 9.541 1e-04 ***
## teach_agg$schoolvar 73 138218 1893.4 6.032 <2e-16 ***
## Residuals 259 81299 313.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 2
## 2 0.03040029 NA
## 3 0.03040029 0.9159568
## 1 2
## 2 0.03040029 NA
## 3 0.03699641 1
## 1 2
## 2 0.0184982 NA
## 3 0.0184982 0.9159568
## diff lwr upr p adj
## 2-1 -8.8555145 -14.417275 -3.293754 0.0006293511
## 3-1 -8.4750918 -13.945595 -3.004589 0.0009161656
## 3-2 0.3804226 -5.470231 6.231077 0.9871323318
teach_agg = my_star('g3tmathss', 'g3classtype', 'g3schid', 'g3tchid')
## g3tchid g3tmathss g3schid g3classtype classtypevar schoolvar mathvar
## 1 11203810 614.6400 112038 3 3 112038 614.6400
## 2 11203811 608.5417 112038 2 2 112038 608.5417
## 3 11203812 581.3077 112038 1 1 112038 581.3077
## 4 12305610 633.4375 123056 2 2 123056 633.4375
## 5 12305611 613.1000 123056 1 1 123056 613.1000
## 6 12305612 633.6364 123056 3 3 123056 633.6364
fit = aov(teach_agg$mathvar ~ teach_agg$classtypevar + teach_agg$schoolvar);
boxcox(fit)
new_fit = aov(log(teach_agg$mathvar) ~ teach_agg$classtypevar + teach_agg$schoolvar)
boxcox(new_fit)
# do this one custom
teach_agg$g3tmathss = log(teach_agg$g3tmathss)
my_star_stat(teach_agg, 1, new_fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## teach_agg$classtypevar 2 0.01193 0.005966 9.555 9.94e-05 ***
## teach_agg$schoolvar 74 0.25823 0.003490 5.589 < 2e-16 ***
## Residuals 257 0.16045 0.000624
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 2
## 2 0.05785374 NA
## 3 0.01377987 0.6432982
## 1 2
## 2 0.08678061 NA
## 3 0.01377987 1
## 1 2
## 2 0.04339030 NA
## 3 0.01377987 0.6432982
## diff lwr upr p adj
## 2-1 -0.010646642 -0.01868794 -0.002605348 0.0056764665
## 3-1 -0.013048919 -0.02061253 -0.005485307 0.0001867896
## 3-2 -0.002402276 -0.01090555 0.006100994 0.7834295503
## 75% of the sample size
#smp_size <- floor(0.75 * nrow(teach_agg)))
## set the seed to make your partition reproducible
#set.seed(123)
#train_ind <- sample(seq_len(nrow(teach_agg)), size = smp_size)
#train <- teach_agg[train_ind, ]
#test <- teach_agg[-train_ind, ]